This project is a case study of “Cyclistic”, a fictional bike-share company. The goal behind this case study is to use the data analysis process as a tool to answer a business question.
The context is as follows. According to Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. For that reason, the director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. The director of marketing has assigned you the following question to answer: “How do annual members and casual riders use Cyclistic bikes differently?”
Being the first step of the data analysis process, the Ask step is all about framing the project, in order to make sure that our work is concise and straight to the point.
First, we must identify the stakeholders of our project, which means the people with a vested interest in our project. The key stakeholders are the company’s executive team since they’re the ones that need to make a decision based on the data we give them. As for secondary stakeholders are the marketing director and the rest of the marketing analytics team that we’ll be working with to form recommendations for the marketing strategy to be adopted.
After that, we move to identifying the business task. In other words we need to define the questions we want answered through our analysis to meet the needs of our stakeholders. In this case, our stakeholders need to convert casual customers to annual members. To do that, we must first understand how both of these segments behave and interact with the product, and this involves finding answers to a number of questions. The question we were tasked to answers is “How do annual members and casual riders use Cyclistic bikes differently?”
Then we decide what type of data we need to answer it.
In this step, we are more concerned with the collection and handling of data.
First, we are beginning with the collection. According to the context of the case, since the data we need is related to customer use, that means we can just get the data using the resources of the company itself and no need for second or third party intervention. This method of collection is the best possible since it supplies us with data that’s more reliable, the original source is easily accessible and verifiable, comprehensive, and current.
In reality, the actual raw data can be obtained here.
The handling of the data covers multiple aspects. Starting with the storage, which will be handled using a MySQL local server, while also keeping a copy of the raw data to act as a backup. Initially the data is organized into smaller datasets divided by month, and we made sure that each of them has a similar structure.
As for the security, the access to the server is using a password. And on top of that all personally identifiable data like names, addresses, and billing information… was scrubbed from the dataset.
The license for this dataset, we are allowed non-exclusive, royalty-free, limited, perpetual permission to access, reproduce, analyze, copy, modify, distribute in your product or service and use the Data for any lawful purpose.
This is the step where we clean the data from all anomalies that may influence the accuracy of our data and insure its integrity and usefulness. For this, we will use a combination of Microsoft Excel and MySQL for different steps of the process. We will do the cleaning process for each file, using the exact same procedures.
The types of dirty data we screen for are:
When it comes to working with large amounts of data, SQL is the tool for the job. It offers multiple tools that can help streamline the data cleaning process. For this project, we will use MySQL to upload the data into our database and interact with it. The steps that were taken to clean the data are as follows:
Note that the cleaning process revealed that the data related to the different stations is too dirty to be cleaned effectively or used in its original state. Some of the problems noticed in the data are:
Under normal circumstances this could be addressed with management to figure out a possible solution, but for this analysis, since we have no alternatives, we decided to remove the names, IDs, and longitude and latitude values related to stations and just focus on the other elements in our analysis.
The SQL query used for this procedure can be found here.
After cleaning the data, now it’s time for analysis. In this step, we try to detect trends within the data. For this purpose, we use the pivot tables feature on the spreadsheets program Google sheets. After exporting the clean data from our database, we use certain sheets functions and formulas to produce the following calculated fields:
Next comes the pivot tables:
These results are grouped in the extracted data.xlsx file.
In this step, we use tableau to make a dashboard out of the data we extracted making it easy to spot the trends and determine the differences between casual users and members.
The dashboard we created is composed of the following elements:
The dashboard can be accessed using the following link.
From the data we got using pivot tables and also the visualizations we built using tableau, we can draw the following conclusions:
Based on all of the above, we recommend the following: